9 research outputs found
Temporal Extension of Scale Pyramid and Spatial Pyramid Matching for Action Recognition
Historically, researchers in the field have spent a great deal of effort to
create image representations that have scale invariance and retain spatial
location information. This paper proposes to encode equivalent temporal
characteristics in video representations for action recognition. To achieve
temporal scale invariance, we develop a method called temporal scale pyramid
(TSP). To encode temporal information, we present and compare two methods
called temporal extension descriptor (TED) and temporal division pyramid (TDP)
. Our purpose is to suggest solutions for matching complex actions that have
large variation in velocity and appearance, which is missing from most current
action representations. The experimental results on four benchmark datasets,
UCF50, HMDB51, Hollywood2 and Olympic Sports, support our approach and
significantly outperform state-of-the-art methods. Most noticeably, we achieve
65.0% mean accuracy and 68.2% mean average precision on the challenging HMDB51
and Hollywood2 datasets which constitutes an absolute improvement over the
state-of-the-art by 7.8% and 3.9%, respectively
Beyond Gaussian Pyramid: Multi-skip Feature Stacking for Action Recognition
Most state-of-the-art action feature extractors involve differential
operators, which act as highpass filters and tend to attenuate low frequency
action information. This attenuation introduces bias to the resulting features
and generates ill-conditioned feature matrices. The Gaussian Pyramid has been
used as a feature enhancing technique that encodes scale-invariant
characteristics into the feature space in an attempt to deal with this
attenuation. However, at the core of the Gaussian Pyramid is a convolutional
smoothing operation, which makes it incapable of generating new features at
coarse scales. In order to address this problem, we propose a novel feature
enhancing technique called Multi-skIp Feature Stacking (MIFS), which stacks
features extracted using a family of differential filters parameterized with
multiple time skips and encodes shift-invariance into the frequency space. MIFS
compensates for information lost from using differential operators by
recapturing information at coarse scales. This recaptured information allows us
to match actions at different speeds and ranges of motion. We prove that MIFS
enhances the learnability of differential-based features exponentially. The
resulting feature matrices from MIFS have much smaller conditional numbers and
variances than those from conventional methods. Experimental results show
significantly improved performance on challenging action recognition and event
detection tasks. Specifically, our method exceeds the state-of-the-arts on
Hollywood2, UCF101 and UCF50 datasets and is comparable to state-of-the-arts on
HMDB51 and Olympics Sports datasets. MIFS can also be used as a speedup
strategy for feature extraction with minimal or no accuracy cost
Strategies for Searching Video Content with Text Queries or Video Examples
The large number of user-generated videos uploaded on to the Internet
everyday has led to many commercial video search engines, which mainly rely on
text metadata for search. However, metadata is often lacking for user-generated
videos, thus these videos are unsearchable by current search engines.
Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity
problem by directly analyzing the visual and audio streams of each video. CBVR
encompasses multiple research topics, including low-level feature design,
feature fusion, semantic detector training and video search/reranking. We
present novel strategies in these topics to enhance CBVR in both accuracy and
speed under different query inputs, including pure textual queries and query by
video examples. Our proposed strategies have been incorporated into our
submission for the TRECVID 2014 Multimedia Event Detection evaluation, where
our system outperformed other submissions in both text queries and video
example queries, thus demonstrating the effectiveness of our proposed
approaches
A unified framework with a benchmark dataset for surveillance event detection
As an important branch of multimedia content analysis, Surveillance Event Detection (SED) is still a quite challenging task due to high abstraction and complexity such as occlusions, cluttered backgrounds and viewpoint changes etc. To address the problem, we propose a unified SED detection framework which divides events into two categories, i.e., short-term events and long-duration events. The former can be represented as a kind of snapshots of static key-poses and embodies an inner-dependencies, while the latter contains complex interactions between pedestrians, and shows obvious inter-dependencies and temporal context. For short-term event, a novel cascade Convolutional Neural Network (CNN)-HsNet is first constructed to detect the pedestrian, and then the corresponding events are classified. For long-duration event, Dense Trajectory (DT) and Improved Dense Trajectory (IDT) are first applied to explore the temporal features of the events respectively, and subsequently, Fisher Vector (FV) coding is adopted to encode raw features and linear SVM classifiers are learned to predict. Finally, a heuristic fusion scheme is used to obtain the results. In addition, a new large-scale pedestrian dataset, named SED-PD, is built for evaluation. Comprehensive experiments on TRECVID SEDtest datasets demonstrate the effectiveness of proposed framework
Informedia E-Lamp @ TRECVID 2013: Multimedia Event Detection and Recounting (MED and MER)
<p>We report on our system used in the TRECVID 2013 Multimedia Event Detection (MED) and Multimedia Event Recounting (MER) tasks. For MED, it consists of four main steps: extracting features, representing features, training detectors and fusion. In the feature extraction part, we extract more than 10 low-level, high-level, and text features. Those features are then represented in three different ways which are spatial bag-of words, Gaussian Mixture Model Super Vectors (GMM) and Fisher Vectors. In the detector training and fusion, two classifiers and weighted double fusion method are employed. The official evaluation results show that our MED full systems achieve the best scores on Ah-Hoc EK10 and EK0, our audio systems achieve the best scores in EK100 and EK10 for both Pre-specified and Ad-Hoc tasks. Our MER system utilizes a subset of features and detection results from the MED system from which the recounting is generated.</p